Fault Tolerant Supercomputing: A Software Approach
نویسندگان
چکیده
Adding fault tolerance to embedded supercomputing applications is becoming an issue of great significance, especially as these applications support critical parts of our everyday life in the modern “Information Society”. To this end, a software middleware framework is presented that features a collection of flexible and reusable fault tolerance modules acting at different levels and coping with common fault tolerance requirements. The burden of ad hoc fault tolerance programming is removed from the application developer, while at the same time average fault tolerance support taken at operating system level is avoided. A high-level description helps the developer specify the fault tolerance strategies of the application as a sort of second application layer; this separates functional from fault tolerance aspects of an application, shortening the development cycle and improving maintainability. Integration of this functionality in real embedded applications validates this approach. Key-Words: Software fault tolerance, high performance computing, embedded parallel and distributed systems, fault-tolerant communication, user-specified recovery strategies, maintainability, separation of design concerns.
منابع مشابه
EFTOS: A Software Framework for More Dependable Embedded HPC Applications
Within the ESPRIT project EFTOS (Embedded Fault-Tolerant Supercomputing), a framework is developed to integrate fault tolerance flexibly and easily into distributed embedded HPC applications . This framework consists of a variety of reusable fault tolerance modules acting at different levels. The cost and performance overhead of generic Operating System and Hardware level fault tolerance mechan...
متن کاملSoftware Tool Combining Fault Masking with User-Defined Recovery Strategies
We describe the voting farm, a tool which implements a distributed software voting mechanism for a number of parallel message passing systems. The tool, developed in the framework of EFTOS (Embedded Fault-Tolerant Supercomputing), can be used in stand-alone mode or in conjunction with other EFTOS fault tolerance tools. In the former case, we describe how the mechanism can be exploited, e.g., to...
متن کاملDesign of an Active Approach for Detection, Estimation and Short-Circuit Stator Fault Tolerant Control in Induction Motors
Three phase induction motors have many applications in industries. Consequently, detecting and estimating the fault and compensate it in a way that the faulty induction motor satisfies the predefined goals are important issues. One of the most common faults in induction motors is the short circuit of the stator winding. In this paper, an active fault-tolerant control system is designed and pres...
متن کاملThe EFTOS Voting Farm: A Software Tool for Fault Masking in Message Passing Parallel Environments
We present a set of C functions implementing a distributed software voting mechanism for EPX or similar message passing environments, and we place it within the EFTOS framework (Embedded Fault-Tolerant Supercomputing, ESPRIT-IV Project 21012) of software tools for enhancing the dependability of a user application. The described mechanism can be used for instance to implement restoring organs i....
متن کاملChannel Reiication: a Reeective Approach to Fault-tolerant Software Development
Reeective systems can be used to ease the implementation of fault tolerance mechanisms in distributed applications as show in Anc95, Fab94]. In this paper we introduce a new model for reeective computations, and we show how it can be used for building up fault tolerant applications.
متن کامل